Automatic identification of variables in epidemiological datasets using logic regression

نویسندگان

  • Matthias W. Lorenz
  • Negin Ashtiani Abdi
  • Frank Scheckenbach
  • Anja Pflug
  • Alpaslan Bülbül
  • Alberico L. Catapano
  • Stefan Agewall
  • Marat Ezhov
  • Michiel L. Bots
  • Stefan Kiechl
  • Andreas Orth
  • Giuseppe D. Norata
  • Jean Philippe Empana
  • Hung-Ju Lin
  • Stela McLachlan
  • Lena Bokemark
  • Kimmo Ronkainen
  • Mauro Amato
  • Ulf Schminke
  • Sathanur R. Srinivasan
  • Lars Lind
  • Akihiko Kato
  • Chrystosomos Dimitriadis
  • Tadeusz Przewlocki
  • Shuhei Okazaki
  • C. D. A. Stehouwer
  • Tatjana Lazarevic
  • Peter Willeit
  • David N. Yanez
  • Helmuth Steinmetz
  • Dirk Sander
  • Holger Poppert
  • Moise Desvarieux
  • M. Arfan Ikram
  • Sebastjan Bevc
  • Daniel Staub
  • Cesare R. Sirtori
  • Bernhard Iglseder
  • Gunnar Engström
  • Giovanni Tripepi
  • Oscar Beloqui
  • Moo-Sik Lee
  • Alfonsa Friera
  • Wuxiang Xie
  • Liliana Grigore
  • Matthieu Plichart
  • Ta-Chen Su
  • Christine Robertson
  • Caroline Schmidt
  • Tomi-Pekka Tuomainen
  • Fabrizio Veglia
  • Henry Völzke
  • Giel Nijpels
  • Aleksandar Jovanovic
  • Johann Willeit
  • Ralph L. Sacco
  • Oscar H. Franco
  • Radovan Hojs
  • Heiko Uthoff
  • Bo Hedblad
  • Hyun Woong Park
  • Carmen Z. Suarez
  • Dong Zhao
  • Pierre Ducimetiere
  • Kuo-Liong Chien
  • Jackie F. Price
  • Göran Bergström
  • Jussi Kauhanen
  • Elena Tremoli
  • Marcus Dörr
  • Gerald Berenson
  • Aikaterini Papagianni
  • Anna Kablak-Ziembicka
  • Kazuo Kitagawa
  • Jaqueline M. Dekker
  • Radojica Stolic
  • Joseph F. Polak
  • Matthias Sitzer
  • Horst Bickel
  • Tatjana Rundek
  • Albert Hofman
  • Robert Ekart
  • Beat Frauchiger
  • Samuela Castelnuovo
  • Maria Rosvall
  • Carmine Zoccali
  • Manuel F. Landecho
  • Jang-Ho Bae
  • Rafael Gabriel
  • Jing Liu
  • Damiano Baldassarre
  • Maryam Kavousi
چکیده

BACKGROUND For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. METHODS For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. RESULTS In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. CONCLUSIONS We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extension of Logic regression to Longitudinal data: Transition Logic Regression

Logic regression is a generalized regression and classification method that is able to make Boolean combinations as new predictive variables from the original binary variables. Logic regression was introduced for case control or cohort study with independent observations. Although in various studies, correlated observations occur due to different reasons, logic regression have not been studi...

متن کامل

Identification of Genetic Polymorphism Interactions in Sporadic Alzheimer’s Disease Using Logic Regression

Objectives: Genetic polymorphism interactions are among the important factors in affliction with complex diseases like Alzheimer’s disease. The important goal of genetic association studies is to identify a combination of polymorphisms and measure their importance in increasing the risk of occurrence of such diseases. In this study, feature selection approach of logic regression was used to ide...

متن کامل

Logic regression and its application in predicting diseases

Regression is one of the most important statistical tools in data analysis and study of the relationship between predictive variables and the response variable. in most issues, regression models and decision tress only can show the main effects of predictor variables on the response and considering interactions between variables does not exceed of two way and ultimately three-way, due to co...

متن کامل

Stock Price Prediction using Machine Learning and Swarm Intelligence

Background and Objectives: Stock price prediction has become one of the interesting and also challenging topics for researchers in the past few years. Due to the non-linear nature of the time-series data of the stock prices, mathematical modeling approaches usually fail to yield acceptable results. Therefore, machine learning methods can be a promising solution to this problem. Methods: In this...

متن کامل

Managed Pressure Drilling Using Integrated Process Control

Control of wellbore pressure during drilling operations has always been important in the oil industry as this can prevent the possibility of well blowout. The present research employs a combination of automatic process control and statistical process control for the first time for the identification, monitoring, and control of both random and special causes in drilling operations. To this end, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 17  شماره 

صفحات  -

تاریخ انتشار 2017